Systems and methods for autodirecting a real-time transmission

ABSTRACT

In some aspects, the described systems and methods provide for a system for processing a stream for real-time transmission. The system comprises a processor in communication with memory. The processor is configured to execute instructions for an autodirection component stored in memory that cause the processor to receive a real-time stream for an artistic performance, detect one or more human persons in the real-time stream, rank the detected one or more human persons in the real-time stream, select, based on the ranking, a subject from the detected one or more human persons, determine a subject framing for the real-time stream based on the selected subject, process the real-time stream to select a portion of each frame in the real-time stream according to the subject framing, wherein the portion of each frame includes at least the subject, and transmit the processed stream in real-time.

CROSS-REFERENCE TO RELATED APPLICATIONS

This Application is a Continuation of U.S. patent application Ser. No.16/399,387, filed Apr. 30, 2019, entitled “SYSTEMS AND METHODS FORAUTODIRECTING A REAL-TIME TRANSMISSION,” which is a Non-Provisional ofProvisional (35 U.S.C. § 119(e)) of U.S. Provisional Application Ser.No. 62/664,640, filed Apr. 30, 2018, entitled “SYSTEMS AND METHODS FORAUTODIRECTING A REAL-TIME TRANSMISSION.” The entire contents of theseapplications are incorporated herein by reference in their entirety.

NOTICE OF MATERIAL SUBJECT TO COPYRIGHT PROTECTION

Portions of the material in this patent document are subject tocopyright protection under the copyright laws of the United States andof other countries. The owner of the copyright rights has no objectionto the facsimile reproduction by anyone of the patent document or thepatent disclosure, as it appears in the United States Patent andTrademark Office publicly available file or records, but otherwisereserves all copyright rights whatsoever. The copyright owner does nothereby waive any of its rights to have this patent document maintainedin secrecy, including without limitation its rights pursuant to 37C.F.R. § 1.14.

BACKGROUND

Conventional approaches to producing a live stream of an event requiremultiple pieces of expensive camera equipment to capture video and/oraudio and a production crew to direct the video and/or audio beforebeing streamed or transmitted to an end user. Typically only live eventswith a major regional, national, or international interest are streamedor transmitted live due to the cost prohibitive nature of the requiredcamera equipment and production crew.

SUMMARY

Conventional approaches to producing a live stream lack a scalable,turn-key mechanism to automatically capture and direct a stream of alive event. Conventional approaches are cost prohibitive for many liveevents, such as local music concerts and comedy shows, that an end usermay be interested in viewing. Moreover, the conventional approachestypically require a production crew to review captured video and/oraudio feeds and decide which feed to include in the stream ortransmission. Lastly, for live events such as local music concerts andcomedy shows, an end user typically relies on receiving a link to thelive stream in order to view the live event. The end user does not haveaccess to a centralized location of such available live streams, as theyare typically scattered over multiple sources and difficult to find.

In some aspects, the described systems and methods provide for anautodirection system that receives automatically captured live videoand/or audio feeds of a live event from permanently or temporarilyinstalled cameras at a venue of the live event. The autodirection systemdirects and/or edits the live video and/or audio feeds in real timebased on one or more metrics. The autodirection system generates areal-time transmission of the event for an end user. In someembodiments, live video feeds from each camera are sent to virtualobservers (also referred to as observers) which apply one or moremetrics to score each feed. For example, the virtual observers mayreceive the streams and analyze them based on audio and video metrics.In some embodiments, the scored metrics are input to a decision makingengine which decides whether to switch to another feed or maintain thecurrent feed and/or change the zoom on the selected feed. The decisionmaking engine may use one or more machine learning algorithms or arule-based system or a combination thereof to make decisions.

In some embodiments, the described systems and methods provide forreceiving a series of audio and video inputs from a live performance,analyzing the inputs in real-time, and make decisions on how to directand produce a real-time transmission of a given performance. Thedecision making engine may use scoring generated by analysis metrics anddetermine what video stream to use and/or where to crop/zoom in thatstream. Once these decisions have been made, a real-time transmissionmay be rendered for delivery to end-users. In some embodiments, theautodirection system may receive different kinds of data and mediasources as inputs. One group of inputs may be audio feeds, including thehouse mix from a concert venue. The audio feeds may also includeindividual audio feeds from different performers and/or microphonespositioned around the venue. A second group of inputs may be raw, highquality video streams from one or more cameras mounted around the venue.Other types of inputs may include data from spatial tracking systemsthat transmit the location of specific performers on stage or motiontracking systems that relay information about the audience. Theaudio/video type inputs may be pulled into the autodirection system viaa multiplexing card or piece of stand-alone hardware that receives thestreams from the venue's mixing board and cameras and relays them in aformat that the autodirection system can understand. Spatial andmotion-related inputs may be transmitted to the autodirection systemeither wirelessly, via direct connection over Ethernet, or anothersuitable medium.

The described systems and methods are advantageous over conventionalapproaches because they do not require expensive camera equipment or aproduction crew to produce the real-time transmission. The directionand/or editing of the feeds of the live event is performed in real timeand the end user receives the real-time transmission as soon as it isgenerated. It is noted that there may be minimal delay due to time takento generate and transmit the real-time transmission. Moreover, thereal-time transmission may omit one or more frames of a portion of aninterest as the real-time transmission is generated and transmitted. Forexample, the system may determine that a “drum solo” has begun in a livemusic event and switch to a live feed capturing the “drum solo.”However, due to the real-time nature of the transmission, a few framesfrom the beginning of the “drum solo” may not be streamed while theautodirection system makes the switch to the appropriate live feed andgenerates the real-time transmission.

In some aspects, the described systems and methods provide for a systemfor selecting a stream for real-time transmission. The system comprisesan autodirection component. The autodirection component is adapted toreceive one or more real-time streams. The autodirection component isfurther adapted to synchronize each received stream to a current time.The autodirection component is further adapted to score each receivedstream with respect to one or more metrics. The autodirection componentis further adapted to select, based on the scored metrics, a stream fromthe one or more real-time streams for real-time transmission. Theautodirection component is further adapted to transmit the selectedstream in real-time.

In some embodiments, the autodirection component is further adapted tozoom and/or pan, based on the scored metrics, to a portion of theselected stream.

In some embodiments, the autodirection component is further adapted toswitch a currently transmitted stream to the selected stream forreal-time transmission. In some embodiments, the switching is performedat a slower rate or a faster rate based on the scored metrics.

In some embodiments, the autodirection component is further adapted toconvert the one or more real-time streams from a first format to asecond format.

In some embodiments, the one or more real-time streams includes a videostream and/or an audio stream.

In some embodiments, at least one of the one or more real-time streamsis received from one or more cameras, the one or more cameras includinga permanently installed camera, a temporarily installed camera, and/or amobile phone camera.

In some embodiments, the one or more metrics relates to a video streamand includes motion tracking, vocalist identification, and/orinstrumentalist identification.

In some embodiments, the one or more metrics relates to an audio streamand includes voice detection, instrument detection, onset detection,intensity, larm, loudness, beat tracking, and/or danceability.

In some embodiments, a first set of metrics is associated with a firstuser and a second set of metrics is associated with a second user.Further, a first stream from the one or more real-time streams isselected for real-time transmission to the first user based on thescored first set of metrics. Further, a second stream from the one ormore real-time streams is selected for real-time transmission to thesecond user based on the scored second set of metrics.

In some aspects, the described systems and methods provide for acomputer implemented method for selecting a stream for real-timetransmission. The method comprises the act of receiving one or morereal-time streams. The method further comprises the act of synchronizingeach received stream to a current time. The method further comprises theact of scoring each received stream with respect to one or more metrics.The method further comprises the act of selecting, based on the scoredmetrics, a stream from the one or more real-time streams for real-timetransmission. The method further comprises the act of transmitting theselected stream in real-time.

In some embodiments, the method further comprises the act of zoomingand/or panning, based on the scored metrics, to a portion of theselected stream.

In some embodiments, the method further comprises the act of switching acurrently transmitted stream to the selected stream for real-timetransmission. In some embodiments, the switching is performed at aslower rate or a faster rate based on the scored metrics.

In some embodiments, the method further comprises the act of convertingthe one or more real-time streams from a first format to a secondformat.

In some embodiments, the one or more real-time streams includes a videostream and/or an audio stream.

In some embodiments, at least one of the one or more real-time streamsis received from one or more cameras, the one or more cameras includinga permanently installed camera, a temporarily installed camera, and/or amobile phone camera.

In some embodiments, the one or more metrics relates to a video streamand includes motion tracking, vocalist identification, and/orinstrumentalist identification.

In some embodiments, the one or more metrics relates to an audio streamand includes voice detection, instrument detection, onset detection,intensity, larm, loudness, beat tracking, and/or danceability.

In some embodiments, a first set of metrics is associated with a firstuser and a second set of metrics is associated with a second user.Further, a first stream from the one or more streams is selected forreal-time transmission to the first user based on the scored first setof metrics. Further, a second stream from the one or more streams isselected for real-time transmission to the second user based on thescored second set of metrics.

In some aspects, the described systems and methods provide for a systemfor processing a stream for real-time transmission. The system comprisesa processor in communication with memory. The processor is configured toexecute instructions for an autodirection component stored in memorythat cause the processor to receive a real-time stream for an artisticperformance, detect one or more human persons in the real-time stream,rank the detected one or more human persons in the real-time stream,select, based on the ranking, a subject from the detected one or morehuman persons, determine a subject framing for the real-time streambased on the selected subject, process the real-time stream to select aportion of each frame in the real-time stream according to the subjectframing, wherein the portion of each frame includes at least thesubject, and transmit the processed stream in real-time.

In some embodiments, the detected one or more human persons are rankedbased on proximity to a camera that captures the real-time stream.

In some embodiments, the detected one or more human persons are rankedbased on determining which human person is singing in the artisticperformance.

In some embodiments, the detected one or more human persons are rankedbased on proximity to a center of each frame in the real-time stream.

In some embodiments, determining the subject framing comprisesdetermining that one of the human persons is singing and selecting thathuman person as the only subject for the portion of each frame.

In some embodiments, determining the subject framing comprisesdetermining that none of the human persons is singing and selecting twoor more human persons as the subjects for the portion of each frame,wherein the portion of each frame includes both the subjects.

In some embodiments, the real-time stream is captured from one or morecameras including a left camera, a right camera, and/or a center camera.

In some embodiments, a second real-time stream for the artisticperformance is not further analyzed subsequent to detecting no humanperson in the second real-time stream.

In some embodiments, a second real-time stream for the artisticperformance is analyzed using one or more backup rules subsequent todetecting no human person in the second real-time stream.

In some embodiments, processing the real-time stream further includesselecting a zoom level for selecting the portion of each frame of thereal-time stream.

In some embodiments, processing the real-time stream further includesselecting a first zoom level for some frames of the real-time stream anda second zoom level for remaining frames of the real-time stream.

In some embodiments, the system determines a penalty based on a qualityof the real-time stream.

In some embodiments, the quality of the real-time stream includeswhether the subject is trackable, whether the subject is out of frame,and/or whether there is noise in detection of the subject.

In some embodiments, a distance of the subject from a camera capturingthe real-time stream is determined based on a size of the head of thesubject.

In some embodiments, detecting one or more human persons in thereal-time stream includes detecting a human body and/or one or moremandatory parts.

In some embodiments, the one or more mandatory parts include an eye, anelbow, and a shoulder.

In some embodiments, the portion of each frame is selected based onmaintaining a minimum margin between the head of the subject and an edgeof the portion of the frame.

In some embodiments, a second real-time stream from a different camerais selected based on a threshold time passing subsequent to an initialtransmission of the processed stream.

In some embodiments, a second real-time stream from a different camerais selected based on an audio stream associated with the real-timetransmission and the second real-time transmission.

In some embodiments, the second real-time stream is selected in responseto presence of a bar, an amplitude intensity, and/or a singing phrase inthe audio stream.

In some aspects, the described systems and methods provide for acomputer implemented method for processing a stream for real-timetransmission, the method comprising the acts of receiving a real-timestream for an artistic performance, detecting one or more human personsin the real-time stream, ranking the detected one or more human personsin the real-time stream, selecting, based on the ranking, a subject fromthe detected one or more human persons, determining a subject framingfor the real-time stream based on the selected subject, processing thereal-time stream to select a portion of each frame in the real-timestream according to the subject framing, wherein the portion of eachframe includes at least the subject, and transmitting the processedstream in real-time.

It should be appreciated that all combinations of the foregoing conceptsand additional concepts discussed in greater detail below (provided suchconcepts are not mutually inconsistent) are contemplated as being partof the inventive subject matter disclosed herein. In particular, allcombinations of claimed subject matter appearing at the end of thisdisclosure are contemplated as being part of the inventive subjectmatter disclosed herein.

BRIEF DESCRIPTION OF DRAWINGS

Various non-limiting embodiments of the technology will be describedwith reference to the following figures. It should be appreciated thatthe figures are not necessarily drawn to scale.

FIG. 1 is a block diagram of a real-time transmission production systemincluding an exemplary autodirection system in accordance with someembodiments of the technology described herein;

FIG. 2 is a block diagram of an autodirection system in accordance withsome embodiments of the technology described herein;

FIG. 3 is a diagram of an exemplary process for producing a real-timetransmission of an event in accordance with some embodiments of thetechnology described herein;

FIGS. 4-12 show exemplary embodiments for processing a stream forreal-time transmission in accordance with some embodiments of thetechnology described herein;

FIG. 13 is a diagram of another exemplary process for producing areal-time transmission of an event in accordance with some embodimentsof the technology described herein;

FIG. 14 shows exemplary interfaces for a software application forconsuming real-time transmission of one or more events; and

FIG. 15 shows an example implementation for an autodirection system inaccordance with some embodiments of the technology described herein; and

FIG. 16 shows an example computer system for executing an autodirectionsystem in accordance with some embodiments of the technology describedherein.

DETAILED DESCRIPTION

As discussed above, conventional approaches to producing a live streamare cost prohibitive for many live events, such as local music concertsand comedy shows, that an end user may be interested in viewing. Theinventors have recognized an autodirection system for producing areal-time transmission of an event that is advantageous overconventional approaches. In some embodiments, the describedautodirection system does not require expensive camera equipment or aproduction crew to produce the real-time transmission.

In particular, the described systems and methods provide for, amongother things, receiving one or more streams from one or more microphonesand/or one or more cameras, synchronizing the received one or morestreams to a current time, scoring each received stream with respect toone or more metrics, and based on the scored metrics, selecting a streamfrom the one or more streams for real time transmission.

The described systems and methods improve computerized real-timetransmission technology by enabling automated capture of live videoand/or audio feeds of a live event, direction of the live video and/oraudio feeds in real time based on one or more metrics, and generation ofa real-time transmission of the event for an end user. The directionand/or editing of the feeds of the live event is performed in real timeand the end user receives the real-time transmission as soon as it isgenerated.

The described systems and methods provide a particular solution to theproblem of providing a scalable, turn-key mechanism to automaticallycapture and direct a stream of a live event. The described systems andmethods provide a particular way for automated generation of a real-timetransmission by, among other things, receiving one or more streams fromone or more microphones and/or one or more cameras, synchronizing thereceived one or more streams to a current time, scoring each receivedstream with respect to one or more metrics, and based on the scoredmetrics, selecting a stream from the one or more streams for real timetransmission.

The described systems and methods may be used for several differentpurposes including, but not limited to, generating real-timetransmissions of local, regional, national and international liveevents, musical concerts, comedy shows, theater plays, sports events,and other suitable live events. The end user may receive a personalizedstream based on the user's preferences or a standard stream suitable forall end users. The end user may access a centralized location to streamavailable live events instead of searching through events scattered overmultiple sources and difficult to find.

FIG. 1 shows a block diagram 100 of a real-time transmission productionsystem including an autodirection system in accordance with someembodiments of the technology described herein. In this illustrativeembodiment, house microphone 102 is placed near the stage for a musicalconcert and one or more cameras 104 are positioned around the venue forthe musical concert. House microphone 102 may include a microphoneplaced near the stage, one or more microphones placed near differentperformers, a soundboard connected to one or more microphones, and/orother suitable equipment for capturing audio from the event. Cameras 104may be permanently or temporarily installed at the venue for the event.Additionally, cameras 104 may be stationary or moving, e.g., on rails,as suitable for capturing the video of the event. Optionally, one ormore of cameras 104 may allow for control of panning and/or zoomingfunctions of the physical camera, as suitable for capturing the video ofthe event. Though this illustrative embodiment is described with therespect to a musical concert, the described systems and methods areequally applicable to any event suitable for real-time transmissions toan end user, such as local, regional, national and international liveevents, comedy shows, theater plays, sports events, and other suitablelive events.

One or more audio feeds from the house microphone 102 and one or morevideo feeds from cameras 104 may be sent to multiplexing hardware 106.The multiplexing hardware 106 may receive the audio and video feeds andconvert them to a stream format appropriate for autodirection system108. Optionally, the multiplexing hardware 106 may synchronize the audioand video feeds to a current time. Optionally, the multiplexing hardware106 may convert one or more feeds from an analog format to a digitalformat or vice versa. In an example, the multiplexing hardware 106includes hardware, such as the MUXLAB 500471-SA HDMI INPUT CARDmanufactured by MUXLAB INC. of Quebec, Canada, to perform the describedfunctionality. Autodirection system 108 may synchronize the video andaudio streams, decide which streams to select, and generate a real-timetransmission of the event. Streaming/upload hardware 110 may send thereal-time transmission via the Internet 112 to a cloud service includinga video transcoding system 114 for transcoding the real-timetransmission and/or a content distribution network (CDN) 116 and/orapplication programming interface (API) 118 for delivery of thereal-time transmission to end user 120. For example, thestreaming/upload hardware 110 may include one or more servers foruploading the real-time transmission to the cloud service. The videotranscoding system 114 may include one or more servers for transcodingthe real-time transmission into one or more formats suitable for enduser 120. The CDN 116 may include one or more servers for delivering thereal-time transmission, transcoded or otherwise, to the end user 120.The API 118 may include functionality for a web-based service, such as awebsite offering a centralized location for accessing real-timetransmissions, to stream the real-time transmission to end user 120.

FIG. 2 shows a block diagram 200 of an autodirection system inaccordance with some embodiments of the technology described herein.Autodirection system 202 may receive audio and/or video streams frommultiplexing hardware 106. Autodirection system 202 includes a component206 for stream synchronization. Component 206 may receive audio streamsinput from a house microphone, a soundboard, and/or one or moremicrophones from different performers. Component 206 may also receivevideo streams from independent cameras and/or cameras plugged into samesystem. Cameras plugged into the same system may not needsynchronization among each other because of no transmission lag.Independent cameras may be a few frames out of synchronization due totransmission lag. Cameras can be moving or stationary, permanently ortemporarily installed. The audio and video streams may be synchronizedusing time stamps on each feed. Additionally or alternatively, the audioand video streams may be synchronized by comparing audio waveforms ineach stream and synchronizing the streams. In some embodiments, themultiplexing hardware 106 is a part of the autodirection system to allowaccess to multiple audio and video streams at the same time. In somecases, signals from user cameras, e.g., smartphone cameras, are added tothe video streams. The audio signals from the user cameras may be usedto synchronize the video signal from the user cameras with the videostreams from cameras 104.

Audio/video observer(s) 208 may receive the streams and analyze thembased on audio and video metrics. The audio metrics may include voicedetection, instrument detection, onset detection, intensity, larm,loudness, beat tracking, danceability, and/or other suitable audiometrics. Executing the instrument detection metric may include usingmachine learning to match an instrument in one or more video streams tothe expected shape. Such an analysis may optionally include one or moreaudio streams of the event to match the instrument based on the type ofsounds produced by the instrument. For example, one or more audiostreams may be analyzed to determine that the instrument in question isa guitar based on the sounds from the instrument. Additionally oralternatively, a piece of tape that is detectable by one or more ofcameras 104 may be attached to an instrument and tracked to follow themovement of the instrument on stage. Other metrics may include laughter(e.g., for a comedy event), applause (e.g., for a sports event), andother suitable metrics, e.g., those described athttp://essentia.upf.edu/documentation/algorithms_reference.html, theentirety of which is incorporated herein by reference.

The video metrics may include motion tracking, vocalist identification,instrumentalist identification, and/or other suitable video metrics.Observers 208 may receive spatial and/or motion tracking informationfrom sensors at venue. For motion tracking, a piece of tape that isdetectable by one or more of cameras 104 may be attached to an artistand tracked to follow the movement of the artist on stage. Observers 208may use machine learning to distinguish between instruments in the videostreams. In some embodiments, observers 208 may use a machine learningmodel to match one or more instruments present in a video stream and/oran audio stream to their expected types of shape and/or their expectedtypes of sound. For example, observers 208 may analyze a video streamand/or an audio stream to determine that the instrument in question hasa type of shape and/or a type of sound associated with a guitarinstrument. Observers 208 may use a machine learning model trained ondata representative of types of shape and/or types of sound and theirassociated instruments. The machine learning model may receive as inputprocessed versions of the audio stream and/or the video stream.Optionally, the audio stream may be processed to isolate auralcharacteristics of the particular instrumental sound being analyzed.Further, optionally, the video stream may be processed to isolate visualcharacteristics of the particular instrumental shape being analyzed. Theaural and/or visual characteristics obtained from processing the audiostream and/or the video stream may be applied as inputs to the machinelearning model in order to determine the corresponding instrument underanalysis.

In some embodiments, an observer-based architecture may be employed foranalyzing the incoming media streams. This type of architecture can besummarized using an analogy from sports broadcasting. In live sportsbroadcasting event, there are camera people, director's assistants andthe director. The camera people are responsible for capturing theimportant elements of a sporting event, as well as choice shots of thecrowd when something interesting is happening. The director's assistantswatch over all the incoming camera feeds and flag what (from theirperspective) the most important thing to focus on is. The director thenmakes the final decision on what shot to cut to and when. In theobserver-based architecture being used, the inputs (e.g., the video andaudio feeds) are akin to the camera people, the observer processes areakin to the director's assistants, and the decision making engine isakin to the director.

This architecture may allow for delegating the analysis of eachaudio/video media stream to a different observer process on theautodirection system 202. As each observer process analyzes its mediastream, the process may continuously output a score for the metrics thatthe process is analyzing the stream for. This score may then be used bythe decision making engine 210 to make edits to the media stream.

In some embodiments, each media stream (audio or video) is fed into anobserver process that analyzes all the activated metrics for the mediastream simultaneously. For example, in the case with three video streamsand one audio stream, four observer processes may be required. Eachobserver process may analyze all metrics for each media stream. Anadvantage to this approach may be that all metrics can be scored andsent to the decision making engine 210 at the same time. However,because analysis of all metrics is happening inside a single observerprocess, a disadvantage to this approach may be that it takes longer torelay the observations to the decision engine. This may lead toinaccurate decisions as the decision making engine 210 may be forced toact before it receives information from all observers.

In some embodiments, each media stream is passed to multiple observers,with each observer being responsible for a single analysis metric. Forexample, in the case with three video streams and one audio stream, andfive metrics, 20 observer processes may be required to assign oneobserver per media stream/metric. An advantage to this approach may bethat analysis of the metrics may be completed in a shortened amount oftime. A disadvantage to this approach may be that the decision makingengine 210 may have to do additional work handling more, smaller chunksof score data from more observer processes.

In some embodiments, each media stream is passed to multiple observers,with each observer being responsible for a group of metrics. Eachobserver may be responsible for a group of metrics that are dependent oneach other (e.g., instrument detection that requires audio and videoanalysis). This approach may be referred to as the compound metricsapproach. Each observer process may analyze one media stream at a time,but each observer process may analyze a subset of all metrics (e.g., twometrics).

In some embodiments, each observer process may output scores for themetric(s) being observed to decision making engine 210. In someembodiments, some or all observer processes include their own decisionengines to decide the best scored metric and only send those on thedecision making engine 210. Depending on the type of media stream beingscored and the metric being analyzed, the scores returned by observerprocesses may be used by the decision making engine 210 to determine avariety of things. The actual scoring values may depend on the metricbeing analyzed and the underlying analysis tools used to generate them.It is then the decision making engine's responsibility to interpret themand act accordingly.

Decision making engine 210 may receive scored metrics as input fromobservers 208 and decide on which video stream (and/or audio stream) touse. Decision making engine 210 and/or observers 208 may decide when/howoften cuts between video streams should occur. For example, decisionmaking engine 210 may decide when the cuts between the video streamsshould occur, while observers 208 may decide how often the cuts betweenthe video streams should occur. Decision making engine 210 and/orobservers 208 may decide when panning and/or zooming should occur. Insome embodiments, panning may be accomplished by using a pan-and-scantechnique. For example, a high-resolution video stream may be receivedand a portion of the stream may be selected using the pan-and-scantechnique. The portion may be selected by decision making engine 210and/or observers 208 based on the scored metrics received as input fromobservers 208. Decision making engine 210 may receive scored metricsasynchronously. In such cases, a threshold number of scored metrics maybe required to be received before a decision is made. Decision makingengine 210 may use one or more machine learning algorithms or arule-based system or a combination thereof. The rule-based system may bedefined based on event type (and/or venue, etc.). In the case ofpre-defined rules, the machine learning system may override thepre-defined rules if there is a conflict or a vote may be taken betweenthe two. In embodiments with a three camera system, camera 1 is shownuntil camera 2 or camera 3 is found to be more interesting. Decisionmaking engine 210 performs analysis in real time. Therefore, a fewframes may be missed before switching video streams, e.g., when a “drumsolo” comes on. Decision making engine 210 and/or observers 208 maydecide on whether to zoom into a given video stream. Decision makingengine 210 and/or observers 208 may use a machine learning algorithm toartistically frame the subject. Rendering engine 212 generates real-timetransmission 204 for end user. Once decisions have been made, they arepassed to the rendering engine 212, where the chosen video stream andcorresponding crops/zooms may be implemented. The edited video may thenbe paired with the audio stream and rendered down to the final real-timetransmission. This final stream may be uploaded to a cloud-basedtranscoding system, passed to hardware responsible for uploading to thechosen transcoding solution, or another suitable destination.

In some embodiments, cameras can have zoom-related decisions happeningat decision making engine 210. One reason to decide zoom levels atdecision making engine 210 is that a preset zoom at the camera may giveless information in the video stream. This can affect observer scoringnegatively. In some embodiments, decision making engine 210 may have apreset zoom level of 50%. This may be helpful in cases where onlypanning is sufficient to initially generate the real-time transmission.Decision making engine 210 may subsequently alter the zoom level as thereal-time transmission is further generated. In some embodiments, videostreams from different kinds of cameras (e.g., pan-tilt-zoom, wideangle, etc.) may be combined in editing.

In some embodiments, decision making engine 210 incorporates a machinelearning model that can detect desirability. For example, the model maybe trained on recorded shows with live crews to determine what kinds ofedits to make to the video streams.

In some embodiments, decision making engine 210 engine generates morethan one final video stream based on user preferences. Decision makingengine 210 may use cameras with wide fields of view and zoom indifferent portions of the video depending on the user. For example, forcameras with wide field of view, decision making engine 210 may haveaccess to video streams including multiple subjects. Decision makingengine 210 may personalize a video stream for a user who is a drummerfan to focus on the drummer, while the engine can personalize anothervideo stream who is a guitar fan to focus on the guitar player, etc.

FIG. 3 shows a diagram of an exemplary process 300 for producing areal-time transmission of an event in accordance with some embodimentsof the technology described herein. In particular, at block 302, process300 begins. At block 304, the autodirection system receives one or morestreams from one or more microphones and/or one or more cameras. Forexample, the autodirection system may receive audio and/or video streamsfrom multiplexing hardware. At block 306, the autodirection systemsynchronizes the received one or more streams to a current time. Forexample, the autodirection system may receive audio streams input from ahouse microphone, a soundboard, and/or one or more microphones fromdifferent performers. The autodirection system may also receive videostreams from independent cameras and/or cameras plugged into samesystem. The audio and video streams may be synchronized using timestamps on each feed. Additionally or alternatively, the audio and videostreams may be synchronized by comparing audio waveforms in each streamand synchronizing the streams. At block 308, the autodirection systemscores each received stream with respect to one or more metrics. Forexample, the autodirection system may receive the streams and analyzethem based on audio and video metrics. The audio metrics may includevoice detection, instrument detection, onset detection, intensity, larm,loudness, beat tracking, danceability, and/or other suitable audiometrics. The video metrics may include motion tracking, vocalistidentification, instrumentalist identification, and/or other suitablevideo metrics. At block 310, the autodirection system selects, based onthe scored metrics, a stream from the one or more streams for real timetransmission. For example, the autodirection system may receive scoredmetrics as input from one or more observer processes and decide, in realtime, on which video stream (and/or audio stream) to use. Optionally,the autodirection system may decide when/how often cuts between videostreams should occur and when panning and/or zooming should occur. Theautodirection system may use one or more machine learning algorithms ora rule-based system or a combination thereof to make these decisions. Atblock 312, process 300 ends.

FIGS. 4-12 show exemplary embodiments for processing a stream forreal-time transmission in accordance with some embodiments of thetechnology described herein. In some embodiments, a left camera, a rightcamera, and/or a center camera are permanently or temporarily installedat a venue of a live event. The center camera may also function as a“ghost” camera that provides the live video and/or audio feeds from thecenter camera without any zoom. In some embodiments, the ghost cameraprovides a wide establishing shot that slowly zooms in and out.Additionally or alternatively, a separate ghost camera may bepermanently or temporarily installed at the venue of the live event.

In some embodiments, the autodirection system may include a processor incommunication with memory. The processor may be configured to executeinstructions for, among other things, an autodirection component storedin memory. The processor may detect one or more bodies of performers atthe live event to include in the real-time transmission. For example,the processor may detect one or more bodies using OPENPOSE, TENSORRT, oranother suitable algorithm(s), which may include a real-timemulti-person keypoint detection library for, among other features, body,face, hands, and foot estimation. In some embodiments, the processor maydetermine a skeleton-based frame that includes one or more mandatorybody parts before concluding that a body has been detected. OPENPOSErepresents the first real-time multi-person system to jointly detecthuman body, hand, facial, and foot keypoints (in total 135 keypoints) onsingle images, and is accessible from the website,github.com/CMU-Perceptual-Computing-Lab/openpose, the entirety of whichis incorporated by reference herein.

In some embodiments, if no bodies are detected at the live event, theprocessor may use one or more backup rules to estimate where the camerasshould point to during the live event. For example, a band may like toperform with the lights off in order to provide an atmosphereappropriate for their performance. However, it may be difficult todetect any bodies of the performers in such a situation. The processormay use a backup rule to point the cameras to the center of the venue insuch a situation. Additionally or alternatively, the processor may use abackup rule to sequentially cycle through different positions at thevenue in order to capture the atmosphere of the performance.Additionally or alternatively, the processor may use a backup rule for“NO-S” subject framing where no subject is found, e.g., because it istoo dark. This subject framing may be wide enough to include all or alarge portion of the venue stage, to ensure that no interesting actionmay be missed. It is noted that the subject framing may include aselection of one or more subjects, or in this case, a determination thatno subject was found.

In some embodiments, camera switching between camera feeds may continueas normal in cases where no subject is found as long as it is determinedthat a performance is being conducted based on a band fingerprint (orheat mapping). A band fingerprint may include position, movement, and/ortopology of, e.g., the band members. For example, the band fingerprintmay include information, stored in a cache, on where the band memberswere standing before it was too dark. The cache may be refreshedperiodically. The cache may be checked to determine whether people aresetting up the next performance, the performers have changed to adifferent band, and/or whether a band is playing or not. In someembodiments, the band fingerprint may be used to black out a band incase they would like to opt out of the real-time transmission of theperformances at the venue.

In some embodiments, there may be no subjects found before theperformance has started and/or between band setups. In these cases, thecameras may not switch as fast as when there is a band playing, so thecamera switching may be slowed down. For example, if no band is playing,then the camera switching may happen differently, e.g., every 30seconds, compared to every 10 seconds when a band is playing.Additionally or alternatively, the camera switching may not happen atall, and instead the ghost camera output feed may be shown until a bandis detected.

In some embodiments, the processor may detect where the singer in theperformance is present in addition to detecting one or more bodies ofthe performers. For example, the processor may determine the performerin the center of the stage to be the singer. In another example, theprocessor may determine the performer having another performer on eitherside to be the singer. In some embodiments, data relating to the currentfeed is provided as input to a recurrent neural network trained todetect a singer. The recurrent neural network may be trained on datawhere faces of performers are tagged as singing or not singing. Forexample, the recurrent neural network may be more than 90% accurate, oranother suitable threshold, in identifying when a performer is notsinging. In some embodiments, the singer may be the highest priority orranked subject and, once detected, may automatically be selected as thesubject of the frame at the next interval when the subject framingand/or the camera feed are selected. In some embodiments, if a singer isdetected and the current subject is not that singer, a penalty may beissued (described further below with respect to FIG. 5 ) to hasten thenext selection of the subject framing and/or camera feed. In someembodiments, data relating to the current feed is provided as input to arecurrent neural network trained to detect a guitar solo or anothersuitable interesting portion of the performance. The recurrent neuralnetwork may be trained on data where performers are tagged as performinga guitar solo or not.

FIG. 4 shows an exemplary embodiment 400 for processing a stream forreal-time transmission in accordance with some embodiments of thetechnology described herein. In FIG. 4 , a frame from the centercamera's video feed is shown. Z1 and Z2 represent the possible zoomlevels available when processing this frame, though the options may notbe so limited. In some embodiments, the processor may constantly zoombetween levels Z1 and Z2 to give an impression of motion to the viewer.This frame shows the full view without focusing on any of the subjects.This configuration may be represented as Camera (C): Center, SubjectFraming (F): Full, Zoom (Z): Z1 or Z2; and Penalty(Q): 0. In someembodiments, the processor may determine one or more subjects for theframe, represented as S+n, where n indicates an additional number ofsubjects. More details regarding the Penalty (Q) are provided furtherbelow.

FIG. 5 shows an exemplary embodiment 500 for processing a stream forreal-time transmission in accordance with some embodiments of thetechnology described herein. In FIG. 5 , a frame from the centercamera's video feed is shown. Z1 and Z2 represent the possible zoomlevels available when processing this frame, though the options may notbe so limited. In some embodiments, the processor may constantly zoombetween levels Z1 and Z2 to give an impression of motion to the viewer.This frame shows a subject determined by the processor and annotatedwith an indicator (e.g., annotated with a dot). This configuration maybe represented as C: Center, F: S+0, Z: Z1 or Z2; and Penalty(Q): 0.

In some embodiments, the center camera with the Full (F) subject framingdescribed above with respect to FIG. 4 may be equivalent to a cameramanwith a camera mounted on a tripod at the back of the venue, providing adramatic overview including the look and feel of the performers. In someembodiments, using the same center camera, but with the S+0 subjectframing (or another suitable subject framing), may be equivalent to acameraman with a handheld camera positioned in the middle of the venue,providing action shots of the performance. Similarly, using the left orright camera, with the S+0 subject framing (or another suitable subjectframing), may be equivalent to a cameraman positioned at the left orright of the venue, respectively.

In some embodiments, the processor may decide whether to use S+0, S+1,S+2, or another suitable S+n subject framing, using two sets ofprobabilities: one for when there is someone singing, and another whenthere is not. When there is someone singing, the processor may select touse tighter shots like S+0 to get the close up shots of the singer, andwhen there is no singer, the processor may raise the probability of S+1and/or S+2 to simultaneously get a more varied look as well as avoidfocusing on someone who is not important to the performance.

In some embodiments, in order to determine one subject for the frame,the processor may determine multiple subjects, rank the detectedsubjects, and select the highest ranked subject. The ranking may bedetermined based on one or more factors including, but not limited to,whether a subject is the singer, a subject's proximity to a center ofthe stage or venue, whether a subject is in motion, and how long a givensubject has been tracked as a subject. For example, the processor maydetermine the head of the subject using OPENPOSE and select the top,bottom, center, or another suitable portion of the head to indicate thepresence of the subject. The processor may determine margins for theleft, center, top, and/or bottom with respect to the indicator for thesubject. The processor may use the margins to ensure that the subject iscorrectly positioned and visible in the frame. In some embodiments, theprocessor may determine the size of the head of the subject in order todetermine a distance of the subject from the camera. In someembodiments, margins of the subject framing may directly correspond tothe head size. As a subject comes closer to the camera, the processormay zoom out and keep the subject in frame because their head size isseeming to grow. Additionally or alternatively, a performer, e.g., adrummer, who is far away from the camera may not be considered as asubject and may be disqualified based on head size, e.g., based on athreshold head size for subject selection.

In some embodiments, the processor may issue a penalty for frames wherethe subject is partially or wholly out of the frame, tracking of thesubject has failed, noise is present in movement detection of thesubject, and/or another suitable situation where the frame does not lookacceptable for transmission to the viewer. The penalty may be used as athreshold to override the current camera feed and cut away to the videofeed from one of the other cameras. For example, if another camera feedis typically selected between six to eight seconds, the penalty mayforce the other camera feed to be selected sooner than this thresholdperiod.

FIG. 6 shows an exemplary embodiment 600 for processing a stream forreal-time transmission in accordance with some embodiments of thetechnology described herein. In FIG. 6 , a frame from the centercamera's video feed is shown. Z1 and Z2 represent the possible zoomlevels available when processing this frame, though the options may notbe so limited. In some embodiments, the processor may constantly zoombetween levels Z1 and Z2 to give an impression of motion to the viewer.This frame shows two subjects determined by the processor and annotatedwith indicators (e.g., annotated with dots). This configuration may berepresented as C: Center, F: S+1, Z: Z1 or Z2; and Penalty(Q): 0. Insome embodiments, in order to determine two subjects for the frame, theprocessor may determine multiple subjects, rank the detected subjects,and select the two highest ranked subjects.

FIG. 7 shows an exemplary embodiment 700 for processing a stream forreal-time transmission in accordance with some embodiments of thetechnology described herein. In FIG. 7 , a frame from the left camera'svideo feed is shown. Z1 and Z2 represent the possible zoom levelsavailable when processing this frame, though the options may not be solimited. In some embodiments, the processor may constantly zoom betweenlevels Z1 and Z2 to give an impression of motion to the viewer. Thisframe shows a subject determined by the processor and annotated with anindicator (e.g., annotated with a dot). This configuration may berepresented as C: Left, F: S+0, Z: Z1 or Z2; and Penalty(Q): 0. In someembodiments, in order to determine one subject for the frame, theprocessor may determine multiple subjects, rank the detected subjects,and select the highest ranked subject. In some embodiments, theprocessor may switch to the left camera's video feed based on soundonset detection. The processor may receive the audio feed associatedwith the performance and determine one or more events to indicatewhether it is appropriate to switch to a different camera's feed. Forexample, an event may include changes in a bar measure. The bar mayrepresent a segment of time corresponding to a specific number of beatsin which each beat is represented by a particular note value. In anotherexample, an event may include a change in amplitude intensity, e.g., thebeginning or end of another section of the performance. In yet anotherexample, one or more singing phrases may signal the beginning or end ofanother section of the performance.

FIG. 8 shows an exemplary embodiment 800 for processing a stream forreal-time transmission in accordance with some embodiments of thetechnology described herein. In FIG. 8 , a frame from the left camera'svideo feed is shown. Z1 and Z2 represent the possible zoom levelsavailable when processing this frame, though the options may not be solimited. In some embodiments, the processor may constantly zoom betweenlevels Z1 and Z2 to give an impression of motion to the viewer. Thisframe shows two subjects determined by the processor and annotated withindicators (e.g., annotated with dots). This configuration may berepresented as C: Left, F: S+1, Z: Z1 or Z2; and Penalty(Q): 0. In someembodiments, in order to determine two subjects for the frame, theprocessor may determine multiple subjects, rank the detected subjects,and select the two highest ranked subjects. In some embodiments, thecamera feeds may be switched within a threshold period, e.g., every sixto eight seconds, and an appropriate switching time within the thresholdperiod may be selected based on one of the events described above.

FIG. 9 shows an exemplary embodiment 900 for processing a stream forreal-time transmission in accordance with some embodiments of thetechnology described herein. In FIG. 9 , a frame from the left camera'svideo feed is shown. Z1 and Z2 represent the possible zoom levelsavailable when processing this frame, though the options may not be solimited. In some embodiments, the processor may constantly zoom betweenlevels Z1 and Z2 to give an impression of motion to the viewer. Thisframe shows three subjects determined by the processor and annotatedwith indicators (e.g., annotated with dots). This configuration may berepresented as C: Left, F: S+2, Z: Z1 or Z2; and Penalty(Q): 0. In someembodiments, in order to determine three subjects for the frame, theprocessor may determine multiple subjects, rank the detected subjects,and select the three highest ranked subjects. In some embodiments, inorder to maintain a resolution quality of the feed, the processor mayenforce a minimum frame size. For example, with a 4K resolution videofeed, in order to maintain a 720p resolution stream, the processor mayat most zoom into the frame such that the minimum height is a third ofthe original frame.

FIG. 10 shows an exemplary embodiment 1000 for processing a stream forreal-time transmission in accordance with some embodiments of thetechnology described herein. In FIG. 10 , a frame from the rightcamera's video feed is shown. Z1 and Z2 represent the possible zoomlevels available when processing this frame, though the options may notbe so limited. In some embodiments, the processor may constantly zoombetween levels Z1 and Z2 to give an impression of motion to the viewer.This frame shows a subject determined by the processor and annotatedwith an indicator (e.g., annotated with a dot). This configuration maybe represented as C: Right, F: S+0, Z: Z1 or Z2; and Penalty(Q): 0. Insome embodiments, in order to determine the subject for the frame, theprocessor may determine multiple subjects, rank the detected subjects,and select the highest ranked subject. In some embodiments, the camerafeeds may be switched within a threshold period, e.g., every six toeight seconds, and an appropriate switching time within the thresholdperiod may be selected based on one of the events described above.

FIG. 11 shows an exemplary embodiment 1100 for processing a stream forreal-time transmission in accordance with some embodiments of thetechnology described herein. In FIG. 11 , a frame from the rightcamera's video feed is shown. Z1 and Z2 represent the possible zoomlevels available when processing this frame, though the options may notbe so limited. In some embodiments, the processor may constantly zoombetween levels Z1 and Z2 to give an impression of motion to the viewer.This frame shows two subjects determined by the processor and annotatedwith indicators (e.g., annotated with dots). This configuration may berepresented as C: Right, F: S+1, Z: Z1 or Z2; and Penalty(Q): 0. In someembodiments, in order to determine the subjects for the frame, theprocessor may determine multiple subjects, rank the detected subjects,and select the two highest ranked subjects. In some embodiments, thecamera feeds may be switched within a threshold period, e.g., every sixto eight seconds, and an appropriate switching time within the thresholdperiod may be selected based on one of the events described above.

FIG. 12 shows an exemplary embodiment 1200 for processing a stream forreal-time transmission in accordance with some embodiments of thetechnology described herein. In FIG. 12 , a frame from the rightcamera's video feed is shown. Z1 and Z2 represent the possible zoomlevels available when processing this frame, though the options may notbe so limited. In some embodiments, the processor may constantly zoombetween levels Z1 and Z2 to give an impression of motion to the viewer.This frame shows three subjects determined by the processor andannotated with indicators (e.g., annotated with dots). Thisconfiguration may be represented as C: Right, F: S+2, Z: Z1 or Z2; andPenalty(Q): 0. In some embodiments, in order to determine the subjectsfor the frame, the processor may determine multiple subjects, rank thedetected subjects, and select the three highest ranked subjects. In someembodiments, the camera feeds may be switched within a threshold period,e.g., every six to eight seconds, and an appropriate switching timewithin the threshold period may be selected based on one of the eventsdescribed above.

In some embodiments, the probabilities of the camera, subject, subjectframing and re-framings, and/or zooms are based on stylistic/artisticpreferences. Within these probabilities, some multipliers may beprovided that make some of the probabilities weighted heavier. Somemultipliers and over-riders may pertain to speed/acceleration of thesubject, whether or not someone is singing, and if there is or is notanyone on stage at all. FIG. 13 is a diagram of an exemplary process1300 for producing a real-time transmission of an event in accordancewith some embodiments of the technology described herein. Process 1300may be implemented on a system comprising a processor in communicationwith memory. The processor may be configured to execute instructions foran autodirection component stored in memory that cause the processor toperform process 1300.

At block 1302, process 1300 begins.

At block 1304, the processor may receive a real-time stream for anartistic performance. In some embodiments, the real-time stream iscaptured from one or more cameras including a left camera, a rightcamera, and/or a center camera.

At block 1306, the processor may detect one or more human persons in thereal-time stream. In some embodiments, detecting one or more humanpersons in the real-time stream may include detecting a human bodyand/or one or more mandatory parts. In some embodiments, the one or moremandatory parts may include an eye, an elbow, and a shoulder. In someembodiments, a distance of the subject from a camera capturing thereal-time stream may be determined based on a size of the head of thesubject.

In some embodiments, a second real-time stream for the artisticperformance may not be further analyzed subsequent to detecting no humanperson in the second real-time stream. In some embodiments, a secondreal-time stream for the artistic performance may be analyzed using oneor more backup rules subsequent to detecting no human person in thesecond real-time stream.

At block 1308, the processor may rank the detected one or more humanpersons in the real-time stream. In some embodiments, the detected oneor more human persons may be ranked based on proximity to a camera thatcaptures the real-time stream. In some embodiments, the detected one ormore human persons may be ranked based on determining which human personis singing in the artistic performance. In some embodiments, thedetected one or more human persons may be ranked based on proximity to acenter of each frame in the real-time stream.

At block 1310, the processor may select, based on the ranking, a subjectfrom the detected one or more human persons. In some embodiments, asecond human person may be selected in addition to the subject, and theportion of each frame includes the subject and the second human person.

At block 1312, the processor may determine a subject framing for thereal-time stream based on the selected subject. In some embodiments,determining the subject framing may include determining that one of thehuman persons is singing and selecting that human person as the onlysubject for the portion of each frame. In some embodiments, determiningthe subject framing may include determining that none of the humanpersons is singing and selecting two or more human persons as thesubjects for the portion of each frame, wherein the portion of eachframe includes both the subjects.

At block 1314, the processor may process the real-time stream to selecta portion of each frame in the real-time stream, wherein the portion ofeach frame includes the subject. In some embodiments, the portion ofeach frame may be selected based on maintaining a minimum margin betweenthe head of the subject and an edge of the portion of the frame. In someembodiments, processing the real-time stream may further includeselecting a zoom level for selecting the portion of each frame of thereal-time stream. In some embodiments, processing the real-time streammay further include selecting a first zoom level for some frames of thereal-time stream and a second zoom level for remaining frames of thereal-time stream.

In some embodiments, the system may determine a penalty based on aquality of the real-time stream. In some embodiments, the quality of thereal-time stream may include whether the subject is trackable, whetherthe subject is out of frame, and/or whether there is noise in detectionof the subject.

At block 1316, the processor may transmit the processed stream inreal-time. In some embodiments, a second real-time stream from adifferent camera may be selected based on a threshold time passingsubsequent to an initial transmission of the processed stream. In someembodiments, a second real-time stream from a different camera may beselected based on an audio stream associated with the real-timetransmission and the second real-time transmission. In some embodiments,the second real-time stream may be selected in response to presence of abar, an amplitude intensity, and/or a singing phrase in the audiostream.

At block 1318, process 1300 ends.

FIG. 14 shows exemplary interfaces 1400, 1420, and 1440 for a softwareapplication for consuming real-time transmission of one or more events.The software application (or app) may be implemented on a mobile devicecomprising a processor in communication with memory. The processor maybe configured to execute instructions for the software applicationstored in memory.

In some embodiments, each screen in the app may represent a venue. Theviewer may receive a real-transmission of an event by navigating to ascreen for the venue where the event is being performed. The venuescreen and/or the real-time transmission may be available in verticaland/or horizontal orientations per the preferences of the viewer.Additionally or alternatively, the viewer may stream the real-timetransmission to a bigger screen, such as a television, using AIRPLAY,CHROMECAST, or another suitable protocol.

In some embodiments, the app benefits from automated capture of livevideo and/or audio feeds of one more events, from permanent ortemporally installed cameras at the venues, and real time directionand/or editing of the feeds to generate the associated real-timetransmission. For example, the viewer may switch between real-timetransmissions from a venue in Brooklyn, a venue in Chicago, and anothersuitable venue. The viewer may receive an enhanced experience where heor she is not trapped in one venue and can experience different venueson his or her mobile device as desired. For example, interface 1400shows options where the viewer may switch from the current performer1402 to another performer 1404.

In some embodiments, the app allows for a viewer to contribute amonetary amount to support the performer via the app. The app may allowfor the viewer to connect the performer's SPOTIFY page, read theperformer's WIKIPEDIA page, or other suitable information feeds for theperformer. For example, interface 1420 shows options where the viewercan read about the performer 1422 in an about section 1424. For example,interface 1440 shows options where the viewer may search for performersor events using search bar 1442, seek out featured and/or otherrecommended performers 1444, and obtain further information 1446 aboutwhen the related real-time transmission(s) will be streamed.

In some embodiments, the app may allow the viewer to view priorperformances from the performer that were streamed in real time. In someembodiments, the app may allow for the viewer to control the directionof the video feeds. For example, the viewer may select that the drummershould always be within the frame, and the autodirection system mayadapt the real-time transmission for the viewer to generate a real-timetransmission where the drummer is always within the frame wherefeasible.

Example Computer Architecture

The hardware that the described systems and methods may reside on canvary based on certain factors. Because the system uses an observer-basedarchitecture, there is flexibility around how many different metrics maybe analyzed during a given performance. If cost dictates that lessrobust hardware be used, the number of metrics that can be analyzedsimultaneously may be reduced. The more robust the hardware available,the greater the number of metrics and media streams that may be analyzedsimultaneously.

One example implementation of the described systems and methods is shownin FIG. 15 . In particular, system 1500 may include one or moreprocessors 1501 that are operable to generate a real-time transmissionof an event (e.g., element 1504). Such information may be stored withinmemory or persisted to storage media. In some embodiments, processors1501 may receive one or more live audio and/or video feeds 1502 receivedin real time from the event. In some embodiments, processors 1501 mayreceive and/or generate scored metrics 1503 for each live feed accordingto the described systems and methods. Processors 1501 may be configuredto execute the described systems and methods to generate the real-timetransmission of the event 1504 based on the one or more live audioand/or video feeds 1502 and the scored metrics 1503.

An illustrative implementation of a computing device 1600 that may beused in connection with any of the embodiments of the disclosureprovided herein is shown in FIG. 16 . The computing device 1600 mayinclude one or more processors 1601 and one or more articles ofmanufacture that comprise non-transitory computer-readable storage media(e.g., memory 1602 and one or more non-volatile storage media 1603). Theprocessor 1601 may control writing data to and reading data from thememory 1602 and the non-volatile storage device 1603 in any suitablemanner. To perform any of the functionality described herein, theprocessor 1601 may execute one or more processor-executable instructionsstored in one or more non-transitory computer-readable storage media(e.g., the memory 1603), which may serve as non-transitorycomputer-readable storage media storing processor-executableinstructions for execution by the processor 1601.

The terms “program” or “software” are used herein in a generic sense torefer to any type of computer code or set of processor-executableinstructions that can be employed to program a computer or otherprocessor to implement various aspects of embodiments as discussedabove. Additionally, it should be appreciated that according to oneaspect, one or more computer programs that when executed perform methodsof the disclosure provided herein need not reside on a single computeror processor, but may be distributed in a modular fashion amongdifferent computers or processors to implement various aspects of thedisclosure provided herein.

Processor-executable instructions may be in many forms, such as programmodules, executed by one or more computers or other devices. Generally,program modules include routines, programs, objects, components, datastructures, etc. that perform particular tasks or implement particularabstract data types. Typically, the functionality of the program modulesmay be combined or distributed as desired in various embodiments.

Also, data structures may be stored in one or more non-transitorycomputer-readable storage media in any suitable form. For simplicity ofillustration, data structures may be shown to have fields that arerelated through location in the data structure. Such relationships maylikewise be achieved by assigning storage for the fields with locationsin a non-transitory computer-readable medium that convey relationshipbetween the fields. However, any suitable mechanism may be used toestablish relationships among information in fields of a data structure,including through the use of pointers, tags or other mechanisms thatestablish relationships among data elements.

All definitions, as defined and used herein, should be understood tocontrol over dictionary definitions, and/or ordinary meanings of thedefined terms.

As referred to herein, the term “in response to” may refer to initiatedas a result of or caused by. In a first example, a first action beingperformed in response to a second action may include interstitial stepsbetween the first action and the second action. In a second example, afirst action being performed in response to a second action may notinclude interstitial steps between the first action and the secondaction.

As used herein in the specification and in the claims, the phrase “atleast one,” in reference to a list of one or more elements, should beunderstood to mean at least one element selected from any one or more ofthe elements in the list of elements, but not necessarily including atleast one of each and every element specifically listed within the listof elements and not excluding any combinations of elements in the listof elements. This definition also allows that elements may optionally bepresent other than the elements specifically identified within the listof elements to which the phrase “at least one” refers, whether relatedor unrelated to those elements specifically identified. Thus, as anon-limiting example, “at least one of A and B” (or, equivalently, “atleast one of A or B,” or, equivalently “at least one of A and/or B”) canrefer, in one embodiment, to at least one, optionally including morethan one, A, with no B present (and optionally including elements otherthan B); in another embodiment, to at least one, optionally includingmore than one, B, with no A present (and optionally including elementsother than A); in yet another embodiment, to at least one, optionallyincluding more than one, A, and at least one, optionally including morethan one, B (and optionally including other elements); etc.

The phrase “and/or,” as used herein in the specification and in theclaims, should be understood to mean “either or both” of the elements soconjoined, i.e., elements that are conjunctively present in some casesand disjunctively present in other cases. Multiple elements listed with“and/or” should be construed in the same fashion, i.e., “one or more” ofthe elements so conjoined. Other elements may optionally be presentother than the elements specifically identified by the “and/or” clause,whether related or unrelated to those elements specifically identified.Thus, as a non-limiting example, a reference to “A and/or B,” when usedin conjunction with open-ended language such as “comprising” can refer,in one embodiment, to A only (optionally including elements other thanB); in another embodiment, to B only (optionally including elementsother than A); in yet another embodiment, to both A and B (optionallyincluding other elements); etc.

Use of ordinal terms such as “first,” “second,” “third,” etc., in theclaims to modify a claim element does not by itself connote anypriority, precedence, or order of one claim element over another or thetemporal order in which acts of a method are performed. Such terms areused merely as labels to distinguish one claim element having a certainname from another element having a same name (but for use of the ordinalterm).

The phraseology and terminology used herein is for the purpose ofdescription and should not be regarded as limiting. The use of“including,” “comprising,” “having,” “containing,” “involving,” andvariations thereof, is meant to encompass the items listed thereafterand additional items.

Having described several embodiments of the techniques described hereinin detail, various modifications, and improvements will readily occur tothose skilled in the art. Such modifications and improvements areintended to be within the spirit and scope of the disclosure.Accordingly, the foregoing description is by way of example only, and isnot intended as limiting. The techniques are limited only as defined bythe following claims and the equivalents thereto.

What is claimed is:
 1. A computer-implemented method for processing areal-time video stream, the method comprising, by a processor: receivinga first real-time video stream of an artistic performance from a firstvideo source; receiving a second real-time video stream of the artisticperformance from a second video source different than the first videosource; time-synchronizing the first and second real-time video streams;detecting one or more human persons in the first real-time video stream;ranking the detected one or more human persons in the first real-timevideo stream; selecting, based on the ranking, a subject from thedetected one or more human persons; determining a subject framing forthe first real-time video stream based on the selected subject;determining a subject framing for the second real-time video streambased on the selected subject; processing the first real-time videostream and the second real-time video stream to select a portion of eachframe in the real-time video stream and the second real-time videostream according to the subject framing, wherein the portion of eachframe includes at least the subject; and generating an output videostream from the first real-time video stream and the second real-timevideo stream based on the selected portion of each frame.
 2. The methodof claim 1, wherein the second video source is a smartphone camera. 3.The method of claim 1, wherein the step of time-synchronizing the firstand second video streams is performed based on audio signals in thefirst and second video streams.
 4. The method of claim 1, wherein thestep of time-synchronizing the first and second real-time video streamssynchronizes each of the first and second real-time video streams to acurrent time.
 5. The method of claim 1, wherein the detected one or morehuman persons are ranked based on proximity to the first and/or secondvideo sources.
 6. The method of claim 1, wherein the detected one ormore human persons are ranked based on a determination of which humanperson is speaking or singing in the artistic performance.
 7. The methodof claim 1, wherein the step of generating the output video streamfurther comprises: after passage of a threshold time subsequent to aninitial transmission of the output video stream, selecting a portion ofthe second real-time video stream for inclusion in the output videostream.
 8. The method of claim 7, further comprising: automaticallyselecting, by a computer processor, portions of the first real-timevideo stream and the second real-time video stream to include in theoutput video stream.
 9. The method of claim 8, further comprising, forat least one selected portion of the first real-time video stream or thesecond real-time video stream, adjusting a zoom of the at least oneselected portion prior to including the at least one selected portion inthe output video stream.
 10. The method of claim 8, wherein theautomatic selection is based on a determination that the subject is nolonger present in the selected framing of the first real-time videostream or in the selected framing of the second real-time video stream.11. A system for generating a video stream for real-time transmission,the system comprising a processor in communication with memory, theprocessor being configured to execute instructions for an autodirectioncomponent stored in memory that cause the processor to: receive a firstreal-time video stream of an artistic performance from a first videosource; receive a second real-time video stream of the artisticperformance from a second video source different than the first videosource; time-synchronize the first and second real-time video streams;detect one or more human persons in the first real-time video stream;rank the detected one or more human persons in the first real-time videostream; select, based on the ranking, a subject from the detected one ormore human persons; determine a subject framing for the first real-timevideo stream based on the selected subject; determine a subject framingfor the second real-time video stream based on the selected subject;process the first real-time video stream and the second real-time videostream to select a portion of each frame in the real-time video streamand the second real-time video stream according to the subject framing,wherein the portion of each frame includes at least the subject; andgenerate an output video stream from the first real-time video streamand the second real-time video stream based on the selected portion ofeach frame.
 12. The system of claim 11, wherein the second video sourceis a smartphone camera.
 13. The system of claim 11, wherein the step oftime-synchronizing the first and second video streams is performed basedon audio signals in the first and second video streams.
 14. The systemof claim 11, wherein the step of time-synchronizing the first and secondreal-time video streams synchronizes each of the first and secondreal-time video streams to a current time.
 15. The system of claim 11,wherein the detected one or more human persons are ranked based onproximity to the first and/or second video sources.
 16. The system ofclaim 11, wherein the detected one or more human persons are rankedbased on a determination of which human person is speaking or singing inthe artistic performance.
 17. The system of claim 11, wherein the stepof generating the output video stream further comprises: after passageof a threshold time subsequent to an initial transmission of the outputvideo stream, selecting a portion of the second real-time video streamfor inclusion in the output video stream.
 18. The system of claim 17,the processor further configured to: automatically select portions ofthe first real-time video stream and the second real-time video streamto include in the output video stream.
 19. The system of claim 18, theprocessor further configured to, for at least one selected portion ofthe first real-time video stream or the second real-time video stream,adjust a zoom of the at least one selected portion prior to includingthe at least one selected portion in the output video stream.
 20. Thesystem of claim 18, wherein the automatic selection is based on adetermination that the subject is no longer present in the selectedframing of the first real-time video stream or in the selected framingof the second real-time video stream.